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Abstract 

Several approximate policy iteration schemes without value functions, 
which focus on policy representation using classifiers and address policy 
learning as a supervised learning problem, have been proposed recently. 
Finding good policies with such methods requires not only an appropriate 
classifier, but also reliable examples of best actions, covering the state 
space sufficiently. Up to this time, little work has been done on appropri- 
ate covering schemes and on methods for reducing the sample complexity 
of such methods, especially in continuous state spaces. This paper fo- 
cuses on the simplest possible covering scheme (a discretized grid over 
the state space) and performs a sample-complexity comparison between 
the simplest (and previously commonly used) rollout sampling allocation 
strategy, which allocates samples equally at each state under considera- 
tion, and an almost as simple method, which allocates samples only as 
needed and requires significantly fewer samples. 



1 Introduction 

Supervised and reinforcement learning are two well-known learning paradigms, 
which have been researched mostly independently. Recent studies have in- 
vestigated using mature supervised learning methods for reinforcement learn- 
ing [9, 6, 10, 7]. Initial results have shown that policies can be approximately 
represented using multi-class classifiers and therefore it is possible to incorporate 
classification algorithms within the inner loops of several reinforcement learning 
algorithms [9, 6, 7]. This viewpoint allows the quantification of the performance 
of reinforcement learning algorithms in terms of the performance of classifica- 
tion algorithms [10]. While a variety of promising combinations become possible 
through this synergy, heretofore there have been limited practical results and 
widely-applicable algorithms. 

Herein we consider approximate policy iteration algorithms, such as those 
proposed by Lagoudakis and Parr [9] as well as Fern et al. [6, 7], which do not 
explicitly represent a value function. At each iteration, a new policy/classifier is 
produced using training data obtained through extensive simulation (rollouts) 
of the previous policy on a generative model of the process. These rollouts aim 
at identifying better action choices over a subset of states in order to form a set 
of data for training the classifier representing the improved policy. The major 



*This project was partially supported by the ICIS-IAS proejct and the Marie Curie Inter- 
national Reintegration Grant MCIRG-CT-2006-044980 awarded to Michail G. Lagoudakis. 



1 



limitation of these algorithms, as also indicated by Lagoudakis and Parr [9], 
is the large amount of rollout sampling employed at each sampled state. It 
is hinted, however, that great improvement could be achieved with sophisti- 
cated management of sampling. Wc have verified this intuition in a companion 
paper [4] that experimentally compared the original approach of uninformed 
uniform sampling with various intelligent sampling techniques. That paper em- 
ployed heuristic variants of well-known algorithms for bandit problems, such as 
Upper Confidence Bounds [1] and Successive Elimination [5], for the purpose of 
managing rollouts (choosing which state to sample from is similar to choosing 
which lever to pull on a bandit machine). It should be noted, however, that de- 
spite the similarity, rollout management has substantial differences to standard 
bandit problems and thus general bandits results are not directly applicable to 
our case. 

The current paper aims to offer a first theoretical insight into the rollout 
sampling problem. This is done through the analysis of the two simplest sample 
allocation methods described in [4] . Firstly, the old method that simply allocates 
an equal, fixed number of samples at each state and secondly the slightly more 
sophisticated method of progressively sampling all states where we are not yet 
reasonably certain of which the policy-improving action would be. 

The remainder of the paper is organised as follows. Section 2 provides the 
necessary background. Section 4 introduces the proposed algorithms, and Sec- 
tion 3 discusses related work. Section 5, which contains an analysis of the 
proposed algorithms, is the main technical contribution. 



2 Preliminaries 

A Markov Decision Process (MDP) is a 6-tuple {S, A, P, R,^, D), where <S is 
the state space of the process, .A is a finite set of actions, P is a Markovian 
transition model (P(s, a, s') denotes the probability of a transition to state s' 
when taking action a in state s), i? is a reward function {R{s, a) is the expected 
reward for taking action a in state s), j G (0, 1] is the discount factor for future 
rewards, and D is the initial state distribution. A deterministic policy tt for an 
MDP is a mapping tt : S i-^ A from states to actions; 7r(s) denotes the action 
choice at state s. The value V'^{s) of a state s under a policy tt is the expected, 
total, discounted reward when the process begins in state s and all decisions at 
all steps are made according to tt: 



Vis) = E 



.t=o 



l*R{st, 7r(st)) |so = s, St ~ P 



(1) 



The goal of the decision maker is to find an optimal policy tt* that maximises 
the expected, total, discounted reward from all states; in other words, (s) > 
V'^{s) for all policies tt and all states s £ S. 

Policy iteration (PI) is an efficient method for deriving an optimal policy. 
It generates a sequence tti, 7r2, tt/c of gradually improving policies, which 
terminates when there is no change in the policy (TTfe = 7rfe_i); TTfe is an optimal 
policy. Improvement is achieved by computing V^* analytically (solving the 
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linear Bellman equations) and the action values 

Q^' {s, a) = R{s, a) + 7 ^ P{s, a, s')V^' {s') , 

s' 

and then determining the improved policy as 7rj+i(s) = argmaxa Q'^^{s, a). 

Policy iteration typically terminates in a small number of steps. However, 
it relies on knowledge of the full MDP model, exact computation and repre- 
sentation of the value function of each policy, and exact representation of each 
policy. Approximate policy iteration (API) is a family of methods, which have 
been suggested to address the "curse of dimensionality" , that is, the huge growth 
in complexity as the problem grows. In API, value functions and policies are 
represented approximately in some compact form, but the iterative improvement 
process remains the same. Apparently, the guarantees for monotonic improve- 
ment, optimality, and convergence are compromised. API may never converge, 
however in practice it reaches good policies in only a few iterations. 

2.1 Rollout estimates 

Typically, API employs some representation of the MDP model to compute the 

value function and derive the improved policy. On the other hand, the Monte- 
Carlo estimation technique of rollouts provides a way of accurately estimating 
Q'^ at any given state-action pair (s,o) without requiring an explicit MDP 
model or representation of the value function. Instead, a generative model of 
the process (a simulator) is used; such a model takes a state-action pair (s, a) 
and returns a reward r and a next state s' sampled from R{s, a) and P{s, a, s') 
respectively. 

A rollout for the state-action pair (s, a) amounts to simulating a single tra- 
jectory of the process beginning from state s, choosing action a for the first step, 
and choosing actions according to the policy tt thereafter up to a certain hori- 
zon T. If we denote the sequence of collected rewards during the i-th simulated 
trajectory as rj'\ t = 0,1,2, ... ,T — 1, then the rollout estimate Q]^^{s, a) of 
the true state-action value function Q'^{s,a) is the observed total discounted 
reward, averaged over all K trajectories: 

K T-l 

Qk (S' ^)^J^1^ Q{i) («' ^) ' (s, a)^ ry . 

i=l t=0 

Similarly, we define Q^''^{s,a) = E (^Y^J~q j^~^rt\ao=a,so=s,at ~ tt, St ~ P) 
to be the actual state-action value function up to horizon T. As will be seen 
later, with a sufficient amount of rollouts and a long horizon T, we can create 
an improved policy n' from tt at any state s, without requiring a model of the 
MDP. 

3 Related work 

Rollout estimates have been used in the Rollout Classification Policy Iteration 
(RCPI) algorithm [9], which has yielded promising results in several learning 
domains. However, as stated therein, it is sensitive to the distribution of training 
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states over the state space. For this reason it is suggested to draw states from 
the discounted future state distribution of the improved poUcy. This tricky- 
to-sample distribution, also used by Fern et al. [7], yields better results. One 
explanation advanced in those studies is the reduction of the potential mismatch 
between the training and testing distributions of the classifier. 

However, in both cases, and irrespectively of the sampling distribution, the 
main drawback is the excessive computational cost due to the need for lengthy 
and repeated rollouts to reach a good level of accuracy in the estimation of the 
value function. In our preliminary experiments with RCPI, it has been observed 
that most of the effort is spent where the action value differences arc either non- 
existent, or so fine that they require a prohibitive number of rollouts to identify 
them. In this paper, we propose and analyse sampling methods to remove this 
performance bottle-neck. By restricting the sampling distribution to the case 
of a uniform grid, we compare the fixed allocation algorithm (Fixed) [9, 7], 
whereby a large fixed amount of rollouts is used for estimating the action values 
in each training state, to a simple incremental sampling scheme based on count- 
ing (Count), where the amount of rollouts in each training state varies. We 
then derive complexity bounds, which show a clear improvement using CoUNT 
that depends only on the structure of differential value functions. 

We note that Fern et al. [7] presented a related analysis. While they go 
into considerably more depth with respect to the classifier, their results are not 
applicable to our framework. This is because they assume that there exists 
some real number A* > which lower-bounds the amount by which the value 
of an optimal action(s) under any policy exceeds the value of the nearest sub- 
optimal action in any state s. Furthermore, the algorithm they analyse uses a 
fixed number of rollouts at each sampled state. For a given minimum A* value 
over all states, they derive the necessary number of rollouts per state to guar- 
antee an improvement step with high probability, but the algorithm offers no 
practical way to guarantee a high probability improvement. We instead derive 
error bounds for the fixed and counting allocation algorithms. Additionally, we 
arc considering continuous, rather than discrete, state spaces. Because of this, 
technically our analysis is much more closely related to that of Auer et al. [2] . 

4 Algorithms to reduce sampling cost 

The total sampling cost depends on the balance between the number of states 
sampled and the number of samples per state. In the fixed allocation scheme [9, 
7], the same number of K\A\ rollouts is allocated to each state in a subset S 
of states and all K rollouts dedicated to a single action are exhausted before 
moving on to the next action. Intuitively, if the desired outcome (superiority 
of some action) in some state can be confidently determined early, there is 
no need to exhaust all K\A\ rollouts available in that state; the training data 
could be stored and the state could be removed from the pool without further 
examination. Similarly, if we can confidently determine that all actions are 
indifferent in some state, we can simply reject it without wasting any more 
rollouts; such rejected states could be replaced by fresh ones which might yield 
meaningful results. These ideas lead to the following question: can we examine 
all states in S collectively in some interleaved manner by selecting each time a 
single state to focus on and allocating rollouts only as needed? 
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Algorithm 1 SampleState 



Input: state s, policy tt, horizon T, discount factor 7 
for (each o e .4) do 

(.s',r) = Simulate(s, a) 

Q''{s,a) = r 

X = s' 

for i = 1 to T - 1 do 

(a;', r) = Simulate(x, 7r(x)) 

X = x' 
end for 
end for 
return Q'^ 



Selecting states from the state pool could be viewed as a problem akin to a 
multi-armed bandit problem, where each state corresponds to an arm. Pulling 
a lever corresponds to sampling the corresponding state once. By sampling a 
state we mean that we perform a single rollout for each action in that state 
as shown in Algorithm 1. This is the minimum amount of information we can 
request from a single state. Thus, the problem is transformed to a variant of 
the classic multi-armed bandit problem. Sc;v(;ral methods have been proposed 
for various versions of this problem, which could potentially be used in this 
context. In this paper, apart from the fixed allocation scheme presented above, 
we also examine a simple counting scheme. 

The algorithms presented here maintain an empirical estimate A'^ (s) of the 
marginal difference of the apparently maximal and the second best of actions. 
This can be represented by the marginal difference in values in state s, 
defined as 

A"(s) = Q"(s,<J- max Q^(s,a), 

^T^'^S TV 

where a* ^ is the action that maximises in state s: 

a*^„ = argmax(5'^(s, a) . 

The case of multiple equivalent maximising actions can be easily handled by 
generalising to sets of actions in the manner of Fern et al. [7] , in particular 

{aeA: Q^'is, a) > Q'^(s, a'), Va' G A} 
maxQ'^{s, a) 

( K"(s)-max„^^.^Q^(s,a), A^^ c A 
I 0, ■ Al„ = A 

However, here we discuss only the single best action case to simplify the exposi- 
tion. The estimate A'^(s) is defined using the empirical value function Q'^{s, a). 

^It is possible to also manage sampling of the actions, but herein we are only concerned 
with the effort saved by managing state sampling. 



A-(s) = 
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5 Complexity of sampling- based policy improve- 
ment 



Rollout algorithms can be used for policy improvement under certain conditions. 

Bertsekas [3] gives several theorems for policy iteration using rollouts and an 
approximate value function that satisfies a consistency property. Specifically, 
Proposition 3.1. therein states that the one-step look-ahead policy tt' computed 
from the approximate value function V"^ , has a value function which is bet- 
ter than the current approximation V"^, if max^g^ E[r(+i + 7y'^(st+i)|7r', St = 
s,at = a] > y'^(s) for all s € S. It is easy to see that an approximate value 
function that uses only sampled trajectories from a fixed policy tt satisfies this 
property if we have an adequate number of samples. While this assures us that 
we can perform rollouts at any state in order to improve upon the given policy, 
it does not lend itself directly to policy iteration. That is, with no way to com- 
pactly represent the resulting rollout policy we would be limited to performing 
deeper and deeper tree searches in rollouts. 

In this section we shall give conditions that allow policy iteration through 
compact representation of rollout policies via a grid and a finite number of 
sampled states and sample trajectories with a finite horizon. Following this, we 
will analyse the complexity of the fixed sampling allocation scheme employed in 
[9, 7] and compare it with an oracle that needs only one sample to determine 
a* ^ for any s e «S and a simple counting scheme. 

5.1 Sufficient conditions 

Assumption 1 (Bounded finite-dimension state space) The state space 
S is a compact subset of [0, 1]''. 

This assumption can be generalised to other bounded state spaces easily. How- 
ever, it is necessary to have this assumption in order to be able to place some 
minimal constraints on the search. 

Assumption 2 (Bounded rewards) R{s,a) e [0,1] for all aGA, s&S. 

This assumption bounds the reward function and can also be generalised easily 
to other bounding intervals. 

Assumption 3 (Holder Continuity) For any policy n G 11, there exists L, a G 
[0, 1], such that for all states s,s' £ S 

\Q-(s,a)-Q^s',a)\<^\\s-s'\\'^ . 

This assumption ensures that the value function Q'^ is fairly smooth. It trivially 
follows in conjunction with Assumptions 1 and 2 that Q'^,A'" are bounded 
everywhere in S if they are bounded for at least one s £ S. Furthermore, the 
following holds: 

Remark 5.1 Given that, by definition, Q^{s,a*^) > A''(s) -t- Q'"[s,a) for all 
a*^, it follows from Assumption 3 that 

Q^s',alJ>Q^s',a) , 



6 



for all s' G S such that ||s — s'||cx3 < \/A^{s)/L. 

This remark implies that the best action in some state s according to will also 
be the best action in a neighbourhood of states around s. This is a reasonable 
condition as there would be no chance of obtaining a reasonable estimate of the 
best action in any region from a single point, if could change arbitrarily fast. 
We assert that MDPs with a similar smoothness property on their transition 
distribution will also satisfy this assumption. 

Finally, wo need an assumption that limits the total number of rollouts that 
we need to take, as states with a smaller A'^ will need more rollouts. 

Assumption 4 (Measure) // fi {S} denotes the Lebesgue measure of set S, 
then, for any n € H, there exist M, /3 > such that iJ,{s gS : A'^(s) < e} < 
Me^ for alle>0. 

This assumption effectively limits the amount of times value-function changes 
lead to best-action changes, as well as the ratio of states where the action val- 
ues are close. This assumption, together with the H51der continuity assumption, 
imposes a certain structure on the space of value functions. We are thus guar- 
anteed that the value function of any policy results in an improved policy which 
is not arbitrarily complex. This in turn, implies that an optimal policy cannot 
be arbitrarily complex either. 

A final difficulty is determining whether there exists some sufficient horizon 
To beyond which it is unnecessary to go. Unfortunately, even though for any 
state s for which Q'^{s, a') > Q'^{s, a), there exists To(s) such that Q^''^{s, a') > 
Q'^''^{s,a) for all T > To(s), Tq grows without bound as we approach a point 
where the best action changes. However, by selecting a fixed, sufficiently large 
rollout horizon, we can still behave optimally with respect to the true value 
function in a compact subset of <S. 

Lemma 5.1 For any policy tt G 11, e > 0, there exists a finite > and a 
compact subset C S such that 

Q'"''^{s, a;„) > Q'''^(s, a) G A, s G S,T > T, 

where a* ,r G A is such that Q'^{s, a*^) > Q'^(s, a) for all a G A. 

Proof From the above assumptions it follows directly that for any e > 0, there 
exists a compact set of states 5^ C «S such that Q'^{s,a*^) > Q'^{s,a') + e for 
all s G 5e, with a' = argmax^^^, Q'^{s,a). Now let Xt — Q'^'^{s,a*.j^) — 

Q'^'^{s,a'). Then. .Xqo — limx^oc ■'^t > £• For any s G Se the limit exists and 
thus by definition 3T^{s) such that xt^ > for all T > T^. Since is compact, 
Tg = supgg5^ ^e(s) also exists.^ g 

This ensures that we can identify the best action within e, using a finite rollout 

horizon, in most of S. Moreover, ijl{S(} > 1 — M2e^ from Assumption 4. 

In standard policy iteration, the improved policy it' over tt has the property 
that the improved action in any state is the action with the highest Q'^ value in 
that state. However, in rollout-based policy iteration, we may only guarantee 
being within e > of the maximally improved policy. 

^For a discount factor 7 < 1 we can simply bound Te with log[e(l — 7)]/ log(7). 
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Algorithm 2 ORACLE 



Input: n, ir 

Set S to a uniforin grid of n states in <S. 
for s G S do 

end for 

return A^^ 4 {a;^ s G S} 



Definition 5.1 (e-improved policy) An e-improved policy w' derived from w 
satisfies 

max Q"(s, a) -e < y^'(s), (2) 

Such a policy will be said to be improving in S if V^l-s) < (s) for all s E S. 
The measure of states for which there can not be improvement is limited by 
Assumption 4. Finding an improved it' for the whole of <S is in fact not possible 
in finite time, since this requires determining the boundaries in S at which the 
best action changes. ^ 

In all cases, we shall attempt to find the improving action a*^ at each 
state s on a uniform grid of n states, with the next policy vr'(s') taking the 
estimated best action d* ^ for the state s closest to s', i.e. it is a nearest- 
neighbour classifier. 

In the remainder, we derive complexity bounds for achieving an e-improved 
policy tt' from tt with probability at least 1 — 6. We shall always assume that we 
are using a sufficiently deep rollout to cover and only consider the number 
of rollouts performed. First, we shall derive the number of states we need to 
sample from in order to guarantee an e-improved policy, under the assumption 
that at each state we have an oracle which can give us the exact Q'^ values for 
each state we examine. Later, we shall consider sample complexity bounds for 
the case where we do not have an oracle, but use empirical estimates Q'^'^ at 
each state. 



5.2 The Oracle algorithm 

Let B{s, p) denote the infinity-norm sphere of radius p centred in s and con- 
sider Alg. 2 (Oracle) that can instantly obtain the state-action value function 
for any point in S. The algorithm creates a uniform grid of n states, such that 
the distance between adjacent states is 2p = - and so can cover <S with 
spheres i3(s, p). Due to Assumption 3, the error in the action values of any state 
in sphere B{s,p) of state s will be bounded by L ( 2„i/d ) • Thus, the resulting 
policy will be L ( ^^^^^ ) "-improved, i.e. this will be the maximum regret it will 
suffer over the maximally improved policy. 



To bound this regret by e, it is sufficient to have n = 




states in the 



grid. The following proposition follows directly. 
Proposition 5.1 Algorithm, 2 results in regret e for n = O 



''To see this, consider 5 = [0, 1] , with some s* ; R{s, ai) > R{s, 02) Vs > s* and R{s, ai) < 
R{s,a2) Vs < s*. Finding s* requires a binary search, at best. 
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Furthermore, as for all s such that A^{s) > Lp", a* ^ will be the improved action 

in all of B{s, p), then tt' will be improving in S with p {S} > 1 — (snw)"^- 
Both the regret and the lack of complete coverage are due to the fact that we 
cannot estimate the best-action boundaries with arbitrary precision in finite 
time. When using rollout sampling, however, even if we restrict ourselves to e 
improvement, we may still make an error due to both the limited number of 
rollouts and the finite horizon of the trajectories. In the remainder, we shall 
derives error bounds for two practical algorithms that employ a fixed grid with 
a finite number of T-horizon rollouts. 

5.3 Error bounds for states 

When we estimate the value function at each s € S using rollouts there is a 
probability that the estimated best action o* ^ is not in fact the best action. 
For any given state under consideration, we can apply the following well-known 
lemma to obtain a bound on this error probability 

Lemma 5.2 (HoefFding inequality) Let X be a random variable in [b,b+Z] 
with X ^ E[X], observed values Xi, . . . , X„ of X , and Xn — ^ X^"=i ^i- Then, 
P(X„ >X + e) = < X -h e) < exp {-2ne^/Z^) for any e > 0. 

Without loss of generality, consider two random variables X,Y G [0,1], with 
empirical means Xn,Yn and empirical difference A„ = X„ — Yn > 0. Their 
means and difference will be denoted as X,Y,A = X — Y respectively. 

Note that ii X > Y, X^ > X - A/2 and y„ < ? + A/2 then necessarily 
Xn > Yn, so P{Xn > Yn\X > Y) > P{Xn >X- A/2 AYn<Y + A„/2). The 
converse is 

P (l„ < y„ I X > < P (x„ < X - A/2 V f„ > F + A/2) (3a) 

< p (x„ < X - A/2) + p (y„ > F + A/2) (3b) 

<2exp(-|A2). (3c) 

Now, consider aj^ such that Q'^(s,a*^) > Q'^{s,a) for all a. Setting X„ = 

Z^^Q'^(s, a* ^) and Yn, = Z~^Q'^{s, a), where Z is a normalising constant such 
that Q e [6, 6+1], we can apply (3). Note that the bound is largest for the action 
a' with value closest to a*^, for which it holds that (5'^(s,a*^) — Q^{s,a') = 
A'^(s). Using this fact and an application of the union bound, we conclude that 
for any state s, from which we have taken c(s) samples, it holds that: 

P[3al„ ^ a;, : Q^s,al„) > Q^s,a)] < 2\A\e^p (-^A^s)A . (4) 



5.4 Uniform sampling: the FIXED algorithm 

As we have seen in the previous section, if we employ a grid of n states, covering 
<S with spheres B{s, p), where p = 27^177, and taking action a* ^ in each sphere 
centred in s, then the resulting policy tt' is only guaranteed to be improved 
within e of the optimal improvement from tt, where e = Lp" . Now, we examine 
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Algorithm 3 Fixed 
Input: n, it, c, T, S 
Set 5 to a uniform grid of n states in <S. 
for s £ S* do 

Estimate Q'^''^{s,a) for all a. 

if A-(s) > z^iiHS^^^ then 

a*_^ = arg max Q'^ 
else 

a*,7r = 7r(s) 
end if 
end for 

return 4 {«* : s € S} 



the case where, instead of obtaining the true a* ^ , we have an estimate a* ^ 
arising from c samples from each action in each state, for a total of cn|.4.| 
samples. Algorithm 3 accepts (i.e. it sets a* ^ to be the empirically highest 
value action in that state) for all states satisfying: 



A'^(.) > ^^21og(2n|^|/^)_ 

The condition ensures that the probability that Q^{s, a^^^) < Q^{s, a*^^), mean- 
ing the optimally improving action is not a*^, at any state is at most 5. This 
can easily be seen by substituting the right hand side of (5) for e in (4). As 
A'^(s) > 0, this results in an error probability of a single state smaller than 5/n 
and we can use a union bound to obtain an error probability of 5 for each policy 
improvement step. 

For each state s e S* that the algorithm considers, the following two cases are 
of interest: (a) A'^(s) < e, meaning that even when we have correctly identified 
a* ^, we arc still not improving over all of B{s, p) and (b) A'^(s) > e. 

While the probability of accepting the wrong action is always bounded by 
5, we must also calculate the probability that we fail to accept an action at 
all, when A'^(s) > e to estimate the expected regret. Restating our acceptance 
condition as A'^(.s) > 6, this is given by: 

P[A''(s) <e]= P[A^(s) - A'^(s) < 61 - A'^(s)] 

= P [A'' (s) - A'^ (s) > A'^ (s) - 61] , A'^ (s) > 6*. (6) 

Is A''(.s) > ei Note that for A''(s) > e, if e > 61 then so is A''. So, in order 
to achieve total probability 5 for all state-action pairs in this case, after some 
calculations, we arrive at this expression for the regret 



By equating the two sides, we get an expression for the minimum number of 
samples necessary per state: 

c = %—^'^n^'^l'^\og(2n\A\IS). 

1j 
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Algorithm 4 Count 
Input: n, tt, C, T, S 

Set ^0 to a uniform grid of n states in <S, ci, . . . , c„ = 0. 
for fc = 1.2, . . . do 
for s G Sk do 

Estimate Q'^''^{s,a) for all a, increment c(s) 
Sk = {s€ Sk-i : A-(s) < z^ 2'og(2y|/.) | 
end for 

if Es c(s) >= C then 

Break, 
end if 
end for 



This directly allows us to state the following proposition. 

Proposition 5.2 The sam,ple com,plexity of Algorithm 3 to achieve regret at 
most e with probability at least 1-SisO (e'^L'^/" [2ei/"] "'^ log Mi^d/a [2e^/»^ " 

5.5 The Count algorithm 

The Count algorithm starts with a policy tt and a set of states ^o, with n= \So\- 
At each iteration k, each sample in Sk is sampled once. Once a state s G Sk 
contains a dominating action, it is removed from the search. So, 

Thus, the number of samples from each state is c(s) > A; if s e Sk- 

We can apply similar arguments to analyse Count, by noting that the algo- 
rithm spends less time in states with higher A'' values. The measure assumption 
then allows us to calculate the number of states with large A'^ and thus, the 
number of samples that are needed. 

We have already established that there is an upper bound on the regret 
depending on the grid resolution e < Lp". We proceed by forming subsets of 
states W„ = {s e 5 : A''(.s) e [2"'", 2^-""}. Note that we only need to consider 
m<l + i5^(logL + alogp). 

Similarly to the previous algorithm, and due to our acceptance condition, 
for each state s £ Wm, we need c{s) > 22"+i^2 j^g ^^^^^ bound the 

total error probability by S. The total number of samples necessary is 

r-EiT72('°Si+alogp)l 
m=0 

A bound on \Wm\ is required to bound this expression. Note that 

At {B{s, p) : A'^is') < eVs' G B{s, p)} <n{s: A'^(s) < e} < Me''. (8) 
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It follows that \Wm\ < M2^(i-'")p-'^ and consequently 

^ c(s) = log ^ ^2'5(i-m)^-d22m+i 

se5 rri=0 

< M2/^+^2 ""^"r^""^ ^Z^nlogg!M. (9) 

(5 

The above results directly in the following proposition: 

Proposition 5.3 The sample complexity of Algorithm 4 to achieve regret at 
most e with probability at least 1-S, is O {l'^/'^ [2eV«] "'^ log [2eV«] "''^ . 

We note that we arc of course not able to remove the dependency on d, which 
is only due to the use of a grid. Nevertheless, we obtain a reduction in sample 
complexity of order e^^ for this very simple algorithm. 



6 Discussion 

We have derived performance pounds for approximate policy improvement with- 
out a value function in continuous MDPs. We compared the usual approach of 
sampling equally from a set of candidate states to the slightly more sophisti- 
cated method of sampling from all candidate states in parallel, and removing a 
candidate state from the set as soon as it was clear which action is best. For 
the second algorithm, we find an improvement of approximately e~^. Our re- 
sults complement those of Fern ct al [7] for relational Markov decision processes. 
However significant amount of future work remains. 

Firstly, we have assumed everywhere that T > T^. While this may be a 
relatively mild assumption for 7 < 1, it is problematic for the undiscounted 
case, as some states would require far deeper rollouts than others to achieve 
regret e. Thus, in future work we would like to examine sample complexity in 
terms of the depth of rollouts as well. 

Secondly, we would like to extend the algorithms to increase the number 
of states that we look at: whenever V^{s) w (s) for all s, then we could 
increase the resolution. For example if, 

+ e < y"'(s) I Vis) > y"'(s)) < s 

then we could increase the resolution around those states with the smallest A'^. 
This would get around the problem of having to select n. 

A related point that has not been addressed herein, is the choice of policy 
representation. The grid-based representation probably makes poor use of the 
available number of states. For the increased-resolution scheme outlined above, 
a classifier such as A;-nearest-neighbour could be employed. Furthermore, regu- 
larised classifiers might affect a smoothing property on the resulting policy, and 
allow the learning of improved policies from a set of states containing erroneous 
best action choices. 

As far as the state allocation algorithms are concerned, in a companion 
paper [4], we have compared the performance of Count and Fixed with ad- 
ditional allocation schemes inspired from the UCB and successive elimination 
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algorithms. Wc have found that all methods outperform Fixed in practice, 
sometimes by an order of magnitude, with the UCB variants being the best 
overall. 

For this reason, in future work wc plan to perform an analysis of such algo- 
rithms. A further extension to deeper searches, by for example managing the 
sampling of actions within a state, could also be performed using techniques 
similar to [8]. 
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